1 Setup and Loading Libraries

knitr::opts_chunk$set(echo = TRUE, include = TRUE,fig.align = "center")
options(scipen = 999, knitr.kable.NA = "")

# Installing libraries (if do not have)
##install.packages("tidyverse")
##install.packages("lubridate")
##install.packages("readxl")
##install.packages("skimr")
##install.packages("magrittr")
##install.packages("tidyquant")
##install.packages("tsibble")
##install.packages("feasts")
##install.packages("ggcorrplot")
##install.packages("glmnet")
##install.packages("caret")
##install.packages("rattle")

# Importing libraries
library(tidyverse)
library(lubridate)
library(readxl)
library(skimr)
library(magrittr)
library(tidyquant)
library(tsibble)
library(feasts)
library(ggcorrplot)
library(glmnet)
library(caret)
library(rattle)

set.seed(1234)

2 Import Data

Electricity price, consumption and production data is imported from csv file.

I have chosen the dollars for the prediction currency due to the fact that the US dollar is one of the most widely used, generally accepted and stable currencies in the world. Other currencies especially TRY may fluctuate greatly over the years due to irrelevant factors of this analysis topic such as inflation, interest, etc.

df <- read_csv("mcp_with_variables.csv") %>% select(-mcp_try, -mcp_euro)
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   date = col_date(format = ""),
##   hour = col_double(),
##   mcp_try = col_double(),
##   mcp_dollars = col_double(),
##   mcp_euro = col_double(),
##   load_plan = col_double(),
##   total_prod = col_double(),
##   natural_gas = col_double(),
##   wind = col_double(),
##   lignite = col_double(),
##   black_coal = col_double(),
##   import_coal = col_double(),
##   fuel_oil = col_double(),
##   geothermal = col_double(),
##   dam = col_double(),
##   naphta = col_double(),
##   biomass = col_double(),
##   river = col_double(),
##   other = col_double()
## )
head(df)
## # A tibble: 6 x 17
##   date        hour mcp_dollars load_plan total_prod natural_gas  wind lignite
##   <date>     <dbl>       <dbl>     <dbl>      <dbl>       <dbl> <dbl>   <dbl>
## 1 2015-06-01     0        39.0     25600     20657.       6198.  169.    2799
## 2 2015-06-01     1        33.1     23900     20292.       6391.  166.    2799
## 3 2015-06-01     2        18.4     23000     19666.       6403.  172.    2799
## 4 2015-06-01     3        12.7     22600     19650.       6388.  182.    2799
## 5 2015-06-01     4        10.9     22500     19664.       6300.  189.    2799
## 6 2015-06-01     5        11.3     21900     20028.       6591.  196.    2799
## # … with 9 more variables: black_coal <dbl>, import_coal <dbl>, fuel_oil <dbl>,
## #   geothermal <dbl>, dam <dbl>, naphta <dbl>, biomass <dbl>, river <dbl>,
## #   other <dbl>

3 Descriptive Analysis

Summary statistics of variables are shown.

We can say that mcp_dollars have some extreme values (p100 = 542). Also, it can be stated that load_plan and total_prod have similar distribution from their histogram.

skim(df)
Data summary
Name df
Number of rows 53112
Number of columns 17
_______________________
Column type frequency:
Date 1
numeric 16
________________________
Group variables None

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2015-06-01 2021-06-21 2018-06-11 2213

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
hour 0 1 11.50 6.92 0.00 5.75 11.50 17.25 23.00 ▇▇▆▇▇
mcp_dollars 0 1 45.20 14.36 0.00 38.54 45.70 54.66 542.00 ▇▁▁▁▁
load_plan 0 1 32657.61 5290.13 0.00 28723.75 32800.00 36500.00 46500.00 ▁▁▃▇▃
total_prod 0 1 29530.99 4863.54 934.26 25849.19 29505.54 33134.57 42814.23 ▁▁▃▇▂
natural_gas 0 1 8817.98 3684.93 13.50 6434.35 9044.28 11381.06 17928.17 ▂▅▇▅▁
wind 0 1 1964.53 1453.06 14.99 769.02 1640.57 2887.93 7486.13 ▇▅▃▁▁
lignite 0 1 3968.60 705.54 0.00 3514.00 4008.00 4464.15 6806.10 ▁▁▇▇▁
black_coal 0 1 393.35 310.50 0.00 139.00 286.00 662.00 1179.00 ▇▃▅▃▁
import_coal 0 1 5607.55 1552.21 330.00 4520.00 5590.00 7006.20 8112.00 ▁▂▇▇▇
fuel_oil 0 1 109.17 182.38 0.00 1.10 47.60 115.00 1008.00 ▇▁▁▁▁
geothermal 0 1 641.32 302.12 7.80 450.25 679.80 876.96 1173.94 ▃▅▇▇▆
dam 0 1 5687.11 2815.37 23.00 3490.68 5552.30 7560.60 15942.88 ▅▇▆▂▁
naphta 0 1 1.67 2.28 0.00 0.00 0.00 4.50 12.14 ▇▂▂▁▁
biomass 0 1 238.81 178.79 0.00 93.94 234.52 378.17 663.21 ▇▂▃▃▂
river 0 1 1825.34 1244.21 113.24 879.07 1403.61 2633.03 6016.68 ▇▅▃▁▁
other 0 1 275.55 118.25 0.00 220.30 265.80 340.00 991.47 ▂▇▁▁▁

Histogram and Box plot of mcp_dollars variable are shown.

ggplot(df, aes(x=mcp_dollars)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

boxplot(df$mcp_dollars)

Histogram load_plan and total_prod variable are shown. We can say that variables are distributed similarly and have characteristics of normal distribution around the mean value of 30000.

ggplot(df, aes(x=load_plan)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(df, aes(x=total_prod)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Box plots of load_plan total_prod variable

boxplot(df$load_plan)

boxplot(df$total_prod)

Seasonality plots of load_plan variable for several hours are shown. 0, 4, 8, 12, 16, 20 values of hours are tried to be plotted. It can be said that there is generally some trend pattern yearly. However, it is seen that it shows some other seasonal pattern for a shorter period of time like weekly or monthly.

as_tsibble(df %>% filter(hour == 0)) %>% gg_season(load_plan) + theme_tq()
## Using `date` as index variable.

as_tsibble(df %>% filter(hour == 4)) %>% gg_season(load_plan) + theme_tq()
## Using `date` as index variable.

as_tsibble(df %>% filter(hour == 8)) %>% gg_season(load_plan) + theme_tq()
## Using `date` as index variable.

as_tsibble(df %>% filter(hour == 12)) %>% gg_season(load_plan) + theme_tq()
## Using `date` as index variable.

as_tsibble(df %>% filter(hour == 16)) %>% gg_season(load_plan) + theme_tq()
## Using `date` as index variable.

as_tsibble(df %>% filter(hour == 20)) %>% gg_season(load_plan) + theme_tq()
## Using `date` as index variable.

Four weeks of MCP Price line graph for every hour is shown to check if there is any seasonality in the variable. There are some weekly seasonality for some hours but there is no general seasonality for whole day.

df %>%
  filter(date >= as_date("2019-01-07") & date <= as_date("2019-02-04")) %>% 
  ggplot(aes(x = date, y = mcp_dollars)) + 
  geom_line(aes(color = as.factor(hour)), size = 0.9, show.legend = T) + 
  labs(title="", y="MCP Price (USD)", x="Date") + 
  facet_wrap(~hour, ncol=4) + 
  theme_tq()

Hourly MCP Price line graph for every day of a week is shown to check if there is any seasonality in the variable. There are some similar patterns for some days of weeks; however, there are not exact pattern for every day.

df %>%
  filter(date >= as_date("2019-01-07") & date <= as_date("2019-01-13")) %>% 
  ggplot(aes(x = hour, y = mcp_dollars)) + 
  geom_line(aes(color = as.factor(date)), size = 0.9) + 
  labs(title="", y="MCP Price (USD)", x="Hour") + 
  theme_tq()

Four weeks Electricity Usage (load_plan) line graph for every hour is shown to check if there is any seasonality in the variable. We can absolutely say that there is weekly seasonality pattern in this variable. Every same day of week shows similar pattern and especially Sundays, there are a sharp decrease in electricity usage.

df %>%
  filter(date >= as_date("2019-01-07") & date <= as_date("2019-02-04")) %>% 
  ggplot(aes(x = date, y = load_plan)) + 
  geom_line(aes(color = as.factor(hour)), size = 0.9, show.legend = T) + 
  labs(title="", y="Electricity Usage", x="Date") + 
#  facet_wrap(~hour, ncol=4) + 
  theme_tq()

Hourly Electricity Usage (load_plan) line graph for every day of a week is shown to check if there is any seasonality. We can also absolutely say that there is hourly seasonality pattern in this variable. Every same hour of a day shows similar pattern. Electricity usage of night hours are low as it is expected.

df %>%
  filter(date >= as_date("2019-01-07") & date <= as_date("2019-01-13")) %>% 
  ggplot(aes(x = hour, y = load_plan)) + 
  geom_line(aes(color = as.factor(date)), size = 0.9) + 
  labs(title="", y="Electricity Usage", x="Hour") + 
  theme_tq()

Four weeks Total Electricity Production (total_prod) line graph for every hour is shown to check if there is any seasonality. It shows similar characteristics with electricity usage (load_plan) variable.

df %>%
  filter(date >= as_date("2019-01-07") & date <= as_date("2019-02-04")) %>% 
  ggplot(aes(x = date, y = total_prod)) + 
  geom_line(aes(color = as.factor(hour)), size = 0.9, show.legend = T) + 
  labs(title="", y="Total Electricity Production", x="Date") + 
#  facet_wrap(~hour, ncol=4) + 
  theme_tq()

Hourly Total Electricity Production (total_prod) line graph for every day of a week is shown to check if there is any seasonality. It shows similar characteristics with electricity usage (load_plan) variable.

df %>%
  filter(date >= as_date("2019-01-07") & date <= as_date("2019-01-13")) %>% 
  ggplot(aes(x = hour, y = total_prod)) + 
  geom_line(aes(color = as.factor(date)), size = 0.9) + 
  labs(title="", y="Total Electricity Production", x="Hour") + 
  theme_tq()

4 Data Manipulation

New variables like lagged variables or ratio type variables are created to normalize the seasonality/trend effects. renew_energy_ratio variable shows ratio of the production fulfilled by renewable energy sources. cons_prod_ratio variable shows how much total production meets demand. Finally, weekly (lag168) and monthly (lag672) lagged variables created from both of these new variables and also from the target variable.

df2 <- df %>%
  mutate(renew_energy_ratio = (wind + geothermal + dam + biomass + river) / total_prod,
         cons_prod_ratio = load_plan / total_prod,
         mcp_dollars_lag168 = lag(mcp_dollars, 168),
         renew_energy_ratio_lag168 = lag(renew_energy_ratio, 168),
         cons_prod_ratio_lag168 = lag(cons_prod_ratio, 168),
         mcp_dollars_lag672 = lag(mcp_dollars, 672),
         renew_energy_ratio_lag672 = lag(renew_energy_ratio, 672),
         cons_prod_ratio_lag672 = lag(cons_prod_ratio, 672)) %>% 
  select(date, hour, mcp_dollars, mcp_dollars_lag168, renew_energy_ratio_lag168, cons_prod_ratio_lag168, 
         mcp_dollars_lag672, renew_energy_ratio_lag672, cons_prod_ratio_lag672) %>% 
  filter(date >= as_date("2015-06-29"))

Variable distribution, scatter and histogram plots are drawn for one week with selected variables

GGally::ggpairs(df2 %>% filter(date >= as_date("2019-01-07") & date <= as_date("2019-01-13")) %>% select(-date, -hour)) + theme_tq()
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

Train and test data sets are created

df2_train <- df2 %>% filter(date < as_date("2021-05-22"))
df2_test <- df2 %>% filter(date >= as_date("2021-05-22"))

5 Modeling

fitControl <- trainControl(method = "repeatedcv",
                           number = 5,
                           repeats = 5)

Linear Regression model with repeated cross validation (fitControl) is created by using caret package (“lm” method). Summary results of the model is shown. R-squared value is not high enough. The explanatory power of the model seems low at first glance.

lin_reg <- train(mcp_dollars ~ ., 
                 data = df2_train %>% select(-date, -hour),
                 method = "lm",
                 trControl = fitControl,
                 tuneLength = 5)
summary(lin_reg)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -226.63   -4.09    0.77    4.78  479.82 
## 
## Coefficients:
##                            Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)               16.481142   0.969461  17.000 < 0.0000000000000002 ***
## mcp_dollars_lag168         0.475283   0.004090 116.210 < 0.0000000000000002 ***
## renew_energy_ratio_lag168 -4.660313   0.639829  -7.284    0.000000000000329 ***
## cons_prod_ratio_lag168    -1.445605   0.661822  -2.184             0.028946 *  
## mcp_dollars_lag672         0.250899   0.004071  61.635 < 0.0000000000000002 ***
## renew_energy_ratio_lag672  5.528294   0.646590   8.550 < 0.0000000000000002 ***
## cons_prod_ratio_lag672    -2.537103   0.660776  -3.840             0.000123 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.07 on 51689 degrees of freedom
## Multiple R-squared:  0.4172, Adjusted R-squared:  0.4171 
## F-statistic:  6167 on 6 and 51689 DF,  p-value: < 0.00000000000000022

Decision Tree model with repeated cross validation (fitControl) is created by using caret package (“rpart” method). RMSE vs Complexity Parameter plot is shown and the model with lowest CP and RMSE is chosen as a final model among the decision trees. Also, the tree plot is shown and it can be said that only lag variables of mcp_dollars is used at the best decision tree alternative.

dec_tree <- train(mcp_dollars ~ ., 
                  data = df2_train %>% select(-date, -hour),
                  method = "rpart",
                  trControl = fitControl,
                  tuneLength = 5)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
trellis.par.set(caretTheme())
plot(dec_tree) 

fancyRpartPlot(dec_tree$finalModel)

Random Forest model with repeated cross validation (fitControl) is created by using caret package (“ranger” method). Impurity measure is used as an importance criteria and number of trees parameter is selected as 50 because of performance issues. Final model of random forest have value of 2 as mtry parameter and splitrule as variance. RMSE value of best random forest alternative is 9.10 and R-squared value of best random forest alternative is around 60% which is better than other model methods (linear regression and decision tree).

random_forest <- train(mcp_dollars ~ ., 
                      data = df2_train %>% select(-date, -hour),
                      method = "ranger",
                      trControl = fitControl,
                      num.trees = 50,
                      importance = "impurity")
plot(random_forest)

random_forest$results
##   mtry min.node.size  splitrule     RMSE  Rsquared      MAE    RMSESD
## 1    2             5   variance 9.112124 0.6070154 5.826402 0.4763914
## 2    2             5 extratrees 9.176268 0.6046028 5.898776 0.5065473
## 3    4             5   variance 9.185223 0.5993560 5.878185 0.4714967
## 4    4             5 extratrees 9.107398 0.6078719 5.860768 0.4679191
## 5    6             5   variance 9.304511 0.5884474 5.920130 0.4976521
## 6    6             5 extratrees 9.105426 0.6072850 5.869751 0.4224447
##   RsquaredSD      MAESD
## 1 0.01897971 0.05536528
## 2 0.02121679 0.05789229
## 3 0.01935909 0.05948841
## 4 0.01932822 0.05861577
## 5 0.02198343 0.06275586
## 6 0.01598648 0.05713555
random_forest$finalModel
## Ranger result
## 
## Call:
##  ranger::ranger(dependent.variable.name = ".outcome", data = x,      mtry = min(param$mtry, ncol(x)), min.node.size = param$min.node.size,      splitrule = as.character(param$splitrule), write.forest = TRUE,      probability = classProbs, ...) 
## 
## Type:                             Regression 
## Number of trees:                  50 
## Sample size:                      51696 
## Number of independent variables:  6 
## Mtry:                             6 
## Target node size:                 5 
## Variable importance mode:         impurity 
## Splitrule:                        extratrees 
## Number of random splits:          1 
## OOB prediction error (MSE):       84.31614 
## R squared (OOB):                  0.5988944

Candidate models are compared below. It is obviously seen that Random Forest model have better R-squared value and lower (so better) MAE and RMSE values. As a result of these analysis, Random Forest model is selected.

results = resamples(list(Linear_Regression = lin_reg, Decision_Tree = dec_tree, Random_Forest = random_forest))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: Linear_Regression, Decision_Tree, Random_Forest 
## Number of resamples: 25 
## 
## MAE 
##                       Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## Linear_Regression 6.891402 6.994913 7.027625 7.035182 7.064994 7.272792    0
## Decision_Tree     7.169201 7.264244 7.294622 7.299560 7.336142 7.423114    0
## Random_Forest     5.748468 5.836326 5.870224 5.869751 5.921145 5.959173    0
## 
## RMSE 
##                        Min.   1st Qu.    Median      Mean   3rd Qu.     Max.
## Linear_Regression 10.155073 10.659505 11.025772 11.068541 11.552270 12.21379
## Decision_Tree     10.675618 10.911633 11.362176 11.362024 11.649588 12.22990
## Random_Forest      8.585016  8.766165  9.051721  9.105426  9.248408 10.30312
##                   NA's
## Linear_Regression    0
## Decision_Tree        0
## Random_Forest        0
## 
## Rsquared 
##                        Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## Linear_Regression 0.3570434 0.4001997 0.4207127 0.4178321 0.4404207 0.4639462
## Decision_Tree     0.3536464 0.3705948 0.3818625 0.3861824 0.4073355 0.4226206
## Random_Forest     0.5594306 0.5982082 0.6093993 0.6072850 0.6148209 0.6377483
##                   NA's
## Linear_Regression    0
## Decision_Tree        0
## Random_Forest        0
bwplot(results)

Prediction and residual analysis for train set of random forest model is shown. Actual vs Predicted plot shows that predicted values are generally close to the actual values. There are few observations which are too much deviated. Histogram of residuals shows normal distribution around mean value of 0 which is good for the model. Finally, predicted vs residuals plot shows that predicted values are distributed around 0 which is also good for the model even if there some few deviated values, they are not so important and not so much.

rf_predicted <- predict(random_forest, data = df2_train)
plot(rf_predicted, df2_train$mcp_dollars, xlab = "Predicted", ylab = "Actual", main = "Actual vs Predicted Plot for Random Forest Model")
abline(a=0,b=1,col='red', lty = 2)

rf_residuals <- df2_train$mcp_dollars - rf_predicted
hist(rf_residuals, xlab = "Residuals", main = "Residuals Histogram of Random Forest Model")

plot(rf_predicted, rf_residuals, xlab = "Predicted", ylab = "Residuals", main = "Predicted vs Residuals Plot for Random Forest Model")
abline(h = 0, col = "red", lty = 2)

Variable Importance of random forest model is shown. mcp_dollars_lag168 variable is the most important variable for the model followed by mcp_dollars_lag672 and renew_energy_ratio_lag168 variables.

sort(random_forest$finalModel$variable.importance, decreasing = TRUE)
##        mcp_dollars_lag168        mcp_dollars_lag672 renew_energy_ratio_lag168 
##                 3792721.6                 2232439.8                 1326302.5 
## renew_energy_ratio_lag672    cons_prod_ratio_lag672    cons_prod_ratio_lag168 
##                 1002438.9                  909557.9                  850075.4

6 Forecast for the Test Period

For the test period of data, target variable (mcp_dollars) are forecasted by using predict function. Prediction and residual analysis for test set of random forest model is shown. Actual vs Predicted plot shows that predicted values are generally close to the actual values. There are few observations which are too much deviated. Histogram of residuals shows normal distribution around mean value of 0 which is good for the model. Finally, predicted vs residuals plot shows that predicted values are distributed around 0 which is also good for the model even if there some few deviated values, they are not so important and not so much.

rf_predicted_test <- predict(random_forest$finalModel, data = df2_test)
plot(rf_predicted_test$predictions, df2_test$mcp_dollars, xlab = "Predicted", ylab = "Actual", main = "Actual vs Predicted Plot for Random Forest Model with Test Set")
abline(a=0,b=1,col='red', lty = 2)

rf_residuals_test <- df2_test$mcp_dollars - rf_predicted_test$predictions
hist(rf_residuals_test, xlab = "Residuals", main = "Residuals Histogram of Random Forest Model")

plot(rf_predicted_test$predictions, rf_residuals_test, xlab = "Predicted", ylab = "Residuals", main = "Predicted vs Residuals Plot for Random Forest Model with Test Set")
abline(h = 0, col = "red", lty = 2)

Lastly, RMSE values of test set predictions are shown. It is not a high value which is also good for the model.

RMSE(df2_test$mcp_dollars, rf_predicted_test$predictions)
## [1] 5.292324